COMP47670 Assignment 2 Autumn 2023/24¶

This assignment's objective is to examine Berlin Marathon log data, with a particular emphasis on split and finish times. The analysis's goal is to comprehend how pacing tactics affect finish times. Strategies for pacing, such as even pacing, positive splits (first half faster), and negative splits (second half faster than the first), will be evaluated. Extreme positive splits, also referred to as "Hitting the Wall" or bonking, are also taken into consideration in this assignment. Getting information about pacing from the race data is the main goal.

Task 1. Data Preparation¶

Import Necessary Libraries¶

In [1]:
import pandas as pd
from functools import reduce
import plotly.graph_objects as go
import plotly.subplots as sp
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, ttest_ind
import warnings
import seaborn as sns
import plotly.express as px

Load Primary 2013 and Secondary 2022 dataset¶

In [2]:
primary_dataset = pd.read_csv('data/2013_data.csv')
secondary_dataset = pd.read_csv('data/2022_data.csv')
In [3]:
primary_dataset.head()
Out[3]:
id platz nation jahrgang ak ak_plazierung sex sex_plazierung z5 z10 z15 z20 z25 z30 z35 z40 halbmarathon netto brutto
0 152 22753 DEU 1953 60 403 M 19040 00:29:30 00:59:24 01:28:59 01:59:07 02:29:48 02:59:52 03:30:06 04:00:18 02:05:56 04:13:55 04:29:54
1 153 22051 DEU 1969 40 753 W 3511 00:29:34 00:59:35 01:29:19 01:58:52 02:28:32 02:58:41 03:28:16 03:58:22 02:05:14 04:11:39 04:27:31
2 154 14033 DEU 1963 50 1512 M 12510 00:26:11 00:52:58 01:19:12 01:45:43 02:12:12 02:39:02 03:05:54 03:36:41 01:51:26 03:50:28 04:02:01
3 155 11137 DEU 1960 50 1183 M 10103 00:26:08 00:52:12 01:18:07 01:44:13 02:10:49 02:37:45 03:04:08 03:31:11 01:49:47 03:42:32 03:54:12
4 156 23030 DEU 1956 55 1077 M 19234 00:31:37 01:01:45 01:30:54 02:00:43 02:30:46 03:01:38 03:31:15 04:02:06 02:07:09 04:14:51 04:30:48
In [4]:
secondary_dataset.head()
Out[4]:
id platz nation jahrgang ak ak_plazierung sex sex_plazierung z5 z10 z15 z20 z25 z30 z35 z40 halbmarathon netto brutto
0 3381 1 KEN 1984 35 1 M 1 00:14:14 00:28:23 00:42:33 00:56:45 01:11:08 01:25:40 01:40:10 01:54:53 00:59:51 02:01:09 02:01:09
1 3382 2 KEN 1988 30 1 M 2 00:14:22 00:28:56 00:43:35 00:58:14 01:13:07 01:28:06 01:43:25 01:59:05 01:01:26 02:05:58 02:05:58
2 3383 3 ETH 1997 H 1 M 3 00:14:50 00:29:46 00:44:40 00:59:40 01:14:44 01:30:01 01:44:55 02:00:03 01:02:55 02:06:28 02:06:28
3 3384 4 ETH 1998 H 2 M 4 00:14:16 00:28:23 00:42:33 00:56:45 01:11:09 01:26:11 01:42:14 01:59:14 00:59:51 02:06:40 02:06:40
4 3385 5 KEN 1994 H 3 M 5 00:14:22 00:28:55 00:43:35 00:58:14 01:13:07 01:28:03 01:43:08 01:59:14 01:01:25 02:06:49 02:06:49

Here, we can see the columns and their values respectively.

Note:

  • In 'ak' i.e Age Groups consist age from U20 = 18,19 and H = 20 - 29, 30 - 50, 50 - 80.
  • jahrgang - Birth year
  • halbmarathon - Half way time
  • netto - Finish time

Data Cleaning¶

  • Convert Time Columns - to convert time values in the format HH:MM:SS to seconds.
  • Handle Missing Values
  • Exclude 0 data values rows
In [5]:
def hh_mm_ss2seconds(hh_mm_ss):
    return reduce(lambda acc, x: acc * 60 + x, map(int, hh_mm_ss.split(':')))

def clean_load_df(path, time_columns, dataset_name):
    df = pd.read_csv(path, converters={col: hh_mm_ss2seconds for col in time_columns})
    df = df.apply(lambda x: x.str.upper() if x.dtype == 'object' else x)
    
    missing_values = df.isnull().sum()
    print(f"Missing values in {dataset_name} dataset:\n{missing_values}")
    df = df.dropna()
    
    return df

time_columns = ['z5', 'z10', 'z15', 'z20', 'z25', 'z30', 'z35', 'z40', 'halbmarathon', 'netto', 'brutto']

primary_dataset = clean_load_df('data/2013_data.csv', time_columns, dataset_name='primary')
secondary_dataset = clean_load_df('data/2022_data.csv', time_columns, dataset_name='Secondary')
Missing values in primary dataset:
id                0
platz             0
nation            0
jahrgang          0
ak                9
ak_plazierung     0
sex               0
sex_plazierung    0
z5                0
z10               0
z15               0
z20               0
z25               0
z30               0
z35               0
z40               0
halbmarathon      0
netto             0
brutto            0
dtype: int64
Missing values in Secondary dataset:
id                0
platz             0
nation            0
jahrgang          0
ak                7
ak_plazierung     0
sex               0
sex_plazierung    0
z5                0
z10               0
z15               0
z20               0
z25               0
z30               0
z35               0
z40               0
halbmarathon      0
netto             0
brutto            0
dtype: int64
In [6]:
primary_dataset = primary_dataset[primary_dataset['halbmarathon'] != 0]
secondary_dataset = secondary_dataset[secondary_dataset['halbmarathon'] != 0]
In [8]:
print("After Removing missing values from primary dataset:")
print(primary_dataset.isnull().sum())

print("\nAfter Removing missing values from secondary dataset:")
print(secondary_dataset.isnull().sum())
After Removing missing values from primary dataset:
id                0
platz             0
nation            0
jahrgang          0
ak                0
ak_plazierung     0
sex               0
sex_plazierung    0
z5                0
z10               0
z15               0
z20               0
z25               0
z30               0
z35               0
z40               0
halbmarathon      0
netto             0
brutto            0
dtype: int64

After Removing missing values from secondary dataset:
id                0
platz             0
nation            0
jahrgang          0
ak                0
ak_plazierung     0
sex               0
sex_plazierung    0
z5                0
z10               0
z15               0
z20               0
z25               0
z30               0
z35               0
z40               0
halbmarathon      0
netto             0
brutto            0
dtype: int64
In [9]:
print("Primary Dataset:")
primary_dataset.head()
Primary Dataset:
Out[9]:
id platz nation jahrgang ak ak_plazierung sex sex_plazierung z5 z10 z15 z20 z25 z30 z35 z40 halbmarathon netto brutto
0 152 22753 DEU 1953 60 403 M 19040 1770 3564 5339 7147 8988 10792 12606 14418 7556 15235 16194
1 153 22051 DEU 1969 40 753 W 3511 1774 3575 5359 7132 8912 10721 12496 14302 7514 15099 16051
2 154 14033 DEU 1963 50 1512 M 12510 1571 3178 4752 6343 7932 9542 11154 13001 6686 13828 14521
3 155 11137 DEU 1960 50 1183 M 10103 1568 3132 4687 6253 7849 9465 11048 12671 6587 13352 14052
4 156 23030 DEU 1956 55 1077 M 19234 1897 3705 5454 7243 9046 10898 12675 14526 7629 15291 16248
In [10]:
print("\nSecondary Dataset:")
secondary_dataset.head()
Secondary Dataset:
Out[10]:
id platz nation jahrgang ak ak_plazierung sex sex_plazierung z5 z10 z15 z20 z25 z30 z35 z40 halbmarathon netto brutto
0 3381 1 KEN 1984 35 1 M 1 854 1703 2553 3405 4268 5140 6010 6893 3591 7269 7269
1 3382 2 KEN 1988 30 1 M 2 862 1736 2615 3494 4387 5286 6205 7145 3686 7558 7558
2 3383 3 ETH 1997 H 1 M 3 890 1786 2680 3580 4484 5401 6295 7203 3775 7588 7588
3 3384 4 ETH 1998 H 2 M 4 856 1703 2553 3405 4269 5171 6134 7154 3591 7600 7600
4 3385 5 KEN 1994 H 3 M 5 862 1735 2615 3494 4387 5283 6188 7154 3685 7609 7609

Summary:

Here, I remove the missing values from the dataset (Primary2013 and Secondary2022) and exclude the 0 inpute records from the dataset respectively. I used time conversion function to convert time values in the format HH:MM:SS to seconds. This data cleaning process is needed for the further analysis.

Task 2. Data Characterisation¶

2.1¶

Summary statistics and Correlation matrix - Primary and Secondary Dataset¶

  • Summary Matrix:

    • For each variable, the summary matrix provides important statistics in a concise manner, such as the mean, standard deviation, and quartiles.
    • This helps with the first comprehension of the dataset's distribution and central tendencies by giving a brief summary of key features.
  • Correlation Matrix:

    • that shows the type and strength of relationships between variables.
    • It measures correlations with values ranging from -1 to 1: Perfect positive correlation is represented by a 1, perfect negative correlation by a -1, and no correlation is represented by a 0.
    • This matrix provides insights into variable interactions by helping in identifying of patterns and dependencies within datasets.
In [14]:
# Summary statistics    
summary_stats = primary_dataset.describe()
print(f"\nSummary Statistics for Primary dataset:")
summary_stats
Summary Statistics for Primary dataset:
Out[14]:
id platz jahrgang ak_plazierung sex_plazierung z5 z10 z15 z20 z25 z30 z35 z40 halbmarathon netto brutto
count 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000 36452.000000
mean 18865.656974 18281.564359 1970.093740 1631.857868 11508.597196 1660.198069 3336.205476 5005.339542 6700.602656 8422.562246 10214.364068 12023.310847 13899.495775 7071.863958 14719.763881 15652.852573
std 10779.508291 10554.363264 9.972185 1330.486694 8088.344200 244.063959 503.183120 771.816290 1033.117122 1355.438898 1653.749961 1989.313759 2353.713212 1087.890709 2468.412850 2933.840603
min 152.000000 1.000000 1928.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3692.000000 7403.000000 0.000000
25% 9527.500000 9142.750000 1963.000000 549.750000 4577.750000 1492.000000 2994.000000 4484.750000 5991.000000 7510.000000 9061.000000 10636.000000 12276.000000 6317.000000 12994.000000 13532.000000
50% 18869.500000 18280.500000 1970.000000 1244.500000 9303.500000 1654.000000 3321.000000 4972.500000 6645.000000 8342.500000 10081.000000 11824.000000 13617.000000 7008.000000 14381.000000 15411.000000
75% 28193.250000 27426.250000 1977.000000 2495.000000 18440.250000 1818.000000 3651.000000 5483.000000 7342.250000 9260.000000 11239.000000 13259.000000 15357.000000 7746.000000 16246.000000 17620.000000
max 37534.000000 36568.000000 1995.000000 5284.000000 27577.000000 3096.000000 5615.000000 8361.000000 11076.000000 14280.000000 17369.000000 20785.000000 23077.000000 11721.000000 25321.000000 26231.000000
In [15]:
# Correlation matrix
correlation_matrix = primary_dataset.corr(numeric_only=True)
print(f"\nCorrelation Matrix for Primary dataset:")
correlation_matrix
Correlation Matrix for Primary dataset:
Out[15]:
id platz jahrgang ak_plazierung sex_plazierung z5 z10 z15 z20 z25 z30 z35 z40 halbmarathon netto brutto
id 1.000000 0.015583 0.015831 0.002472 0.008980 0.018643 0.016722 0.017299 0.016094 0.017785 0.017186 0.017379 0.016477 0.017168 0.019194 0.019727
platz 0.015583 1.000000 -0.161033 0.471429 0.651280 0.897869 0.903838 0.903334 0.929208 0.920572 0.947441 0.954216 0.948403 0.936801 0.968782 0.969400
jahrgang 0.015831 -0.161033 1.000000 0.028964 -0.189074 -0.142190 -0.150286 -0.154424 -0.164262 -0.164859 -0.166782 -0.166461 -0.161973 -0.164773 -0.168933 -0.141555
ak_plazierung 0.002472 0.471429 0.028964 1.000000 0.772383 0.383733 0.381237 0.379681 0.388801 0.386208 0.405421 0.415887 0.428671 0.391872 0.438931 0.441337
sex_plazierung 0.008980 0.651280 -0.189074 0.772383 1.000000 0.535474 0.535983 0.539206 0.554399 0.553095 0.580128 0.594813 0.607089 0.559569 0.623911 0.618011
z5 0.018643 0.897869 -0.142190 0.383733 0.535474 1.000000 0.962696 0.947572 0.959359 0.932164 0.933605 0.919731 0.886981 0.963415 0.904350 0.919598
z10 0.016722 0.903838 -0.150286 0.381237 0.535983 0.962696 1.000000 0.950056 0.965781 0.939212 0.943457 0.930165 0.897377 0.970306 0.915656 0.927912
z15 0.017299 0.903334 -0.154424 0.379681 0.539206 0.947572 0.950056 1.000000 0.962947 0.941839 0.947245 0.936219 0.902245 0.968083 0.919180 0.927662
z20 0.016094 0.929208 -0.164262 0.388801 0.554399 0.959359 0.965781 0.962947 1.000000 0.967556 0.973447 0.965500 0.930205 0.992844 0.951169 0.956477
z25 0.017785 0.920572 -0.164859 0.386208 0.553095 0.932164 0.939212 0.941839 0.967556 1.000000 0.963344 0.955607 0.924137 0.972463 0.946458 0.947570
z30 0.017186 0.947441 -0.166782 0.405421 0.580128 0.933605 0.943457 0.947245 0.973447 0.963344 1.000000 0.980771 0.958459 0.981577 0.974426 0.972102
z35 0.017379 0.954216 -0.166461 0.415887 0.594813 0.919731 0.930165 0.936219 0.965500 0.955607 0.980771 1.000000 0.965028 0.971846 0.984223 0.977729
z40 0.016477 0.948403 -0.161973 0.428671 0.607089 0.886981 0.897377 0.902245 0.930205 0.924137 0.958459 0.965028 1.000000 0.938668 0.969978 0.961539
halbmarathon 0.017168 0.936801 -0.164773 0.391872 0.559569 0.963415 0.970306 0.968083 0.992844 0.972463 0.981577 0.971846 0.938668 1.000000 0.960078 0.964610
netto 0.019194 0.968782 -0.168933 0.438931 0.623911 0.904350 0.915656 0.919180 0.951169 0.946458 0.974426 0.984223 0.969978 0.960078 1.000000 0.988371
brutto 0.019727 0.969400 -0.141555 0.441337 0.618011 0.919598 0.927912 0.927662 0.956477 0.947570 0.972102 0.977729 0.961539 0.964610 0.988371 1.000000
In [16]:
# Summary statistics    
summary_stats = secondary_dataset.describe()
print(f"\nSummary Statistics for Secondary dataset:")
summary_stats
Summary Statistics for Secondary dataset:
Out[16]:
id platz jahrgang ak_plazierung sex_plazierung z5 z10 z15 z20 z25 z30 z35 z40 halbmarathon netto brutto
count 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000 34724.000000
mean 20753.856612 17376.021397 1978.902114 1264.282456 9678.847713 1668.268777 3359.813414 5063.242656 6792.635756 8578.037928 10432.448047 12327.157528 14307.832479 7170.832796 15140.201100 15487.002995
std 10030.074395 10031.989265 10.683553 934.498868 6435.072950 282.526120 587.091829 907.490698 1255.230187 1638.801193 2064.926889 2512.857196 2958.812524 1324.205628 3098.147153 3127.074812
min 3381.000000 1.000000 1935.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3591.000000 7269.000000 7269.000000
25% 12068.750000 8688.750000 1971.000000 493.000000 4343.750000 1474.000000 2951.000000 4433.000000 5922.750000 7430.000000 8966.750000 10537.000000 12188.000000 6246.000000 12905.000000 13312.750000
50% 20754.500000 17376.500000 1979.000000 1089.000000 8685.500000 1658.000000 3328.000000 5002.000000 6690.000000 8412.000000 10182.000000 11997.000000 13927.500000 7057.000000 14743.000000 15119.000000
75% 29439.250000 26063.250000 1987.000000 1827.000000 14557.250000 1856.000000 3743.000000 5639.000000 7577.000000 9586.000000 11692.000000 13871.250000 16166.000000 7992.000000 17101.000000 17489.000000
max 38125.000000 34752.000000 2004.000000 3826.000000 23246.000000 4858.000000 6957.000000 9551.000000 12893.000000 16454.000000 19901.000000 23398.000000 26892.000000 13648.000000 28385.000000 28726.000000
In [17]:
# Correlation matrix
correlation_matrix = secondary_dataset.corr(numeric_only=True)
print(f"\nCorrelation Matrix for Secondary dataset:")
correlation_matrix
Correlation Matrix for Secondary dataset:
Out[17]:
id platz jahrgang ak_plazierung sex_plazierung z5 z10 z15 z20 z25 z30 z35 z40 halbmarathon netto brutto
id 1.000000 1.000000 -0.213088 0.559392 0.728643 0.909948 0.923931 0.928243 0.931870 0.943503 0.953334 0.957996 0.963941 0.939431 0.974226 0.971923
platz 1.000000 1.000000 -0.213087 0.559386 0.728640 0.909951 0.923936 0.928250 0.931878 0.943512 0.953345 0.958007 0.963951 0.939440 0.974236 0.971932
jahrgang -0.213088 -0.213087 1.000000 0.128877 -0.255340 -0.199546 -0.208912 -0.212135 -0.213663 -0.212757 -0.214436 -0.210787 -0.209275 -0.214350 -0.216690 -0.217708
ak_plazierung 0.559392 0.559386 0.128877 1.000000 0.787620 0.462249 0.463504 0.462379 0.465644 0.477334 0.491154 0.508157 0.524066 0.469950 0.532678 0.532896
sex_plazierung 0.728643 0.728640 -0.255340 0.787620 1.000000 0.613366 0.620509 0.621853 0.625897 0.639990 0.656769 0.674251 0.690539 0.631944 0.703687 0.703083
z5 0.909948 0.909951 -0.199546 0.462249 0.613366 1.000000 0.989677 0.980028 0.968188 0.958865 0.945307 0.925610 0.914405 0.971864 0.917924 0.919837
z10 0.923931 0.923936 -0.208912 0.463504 0.620509 0.989677 1.000000 0.993414 0.984715 0.977090 0.964838 0.946120 0.934529 0.988732 0.938170 0.939389
z15 0.928243 0.928250 -0.212135 0.462379 0.621853 0.980028 0.993414 1.000000 0.988328 0.984583 0.972775 0.955002 0.943805 0.992872 0.947602 0.947936
z20 0.931870 0.931878 -0.213663 0.465644 0.625897 0.968188 0.984715 0.988328 1.000000 0.986399 0.978385 0.963695 0.952846 0.993169 0.955834 0.955294
z25 0.943503 0.943512 -0.212757 0.477334 0.639990 0.958865 0.977090 0.984583 0.986399 1.000000 0.989390 0.975979 0.966606 0.992985 0.970624 0.969112
z30 0.953334 0.953345 -0.214436 0.491154 0.656769 0.945307 0.964838 0.972775 0.978385 0.989390 1.000000 0.985357 0.977844 0.985655 0.981957 0.979793
z35 0.957996 0.958007 -0.210787 0.508157 0.674251 0.925610 0.946120 0.955002 0.963695 0.975979 0.985357 1.000000 0.981064 0.969767 0.985297 0.982706
z40 0.963941 0.963951 -0.209275 0.524066 0.690539 0.914405 0.934529 0.943805 0.952846 0.966606 0.977844 0.981064 1.000000 0.959579 0.989897 0.987042
halbmarathon 0.939431 0.939440 -0.214350 0.469950 0.631944 0.971864 0.988732 0.992872 0.993169 0.992985 0.985655 0.969767 0.959579 1.000000 0.964508 0.963714
netto 0.974226 0.974236 -0.216690 0.532678 0.703687 0.917924 0.938170 0.947602 0.955834 0.970624 0.981957 0.985297 0.989897 0.964508 1.000000 0.996977
brutto 0.971923 0.971932 -0.217708 0.532896 0.703083 0.919837 0.939389 0.947936 0.955294 0.969112 0.979793 0.982706 0.987042 0.963714 0.996977 1.000000

Data Visualisation¶

1. 'halbmarathon' vs 'netto' - Scatter Plot¶

In [18]:
scatter_fig = px.scatter(primary_dataset, x='halbmarathon', y='netto',
                         title='Scatter Plot - Half Marathon Time vs Finish Time',
                         labels={'halbmarathon': 'Half Marathon Time (seconds)', 'netto': 'Finish Time (seconds)'})
scatter_fig.show()

Summary:

  • The above scatter plot shows a positive correlation between half marathon time and finish time in seconds from 4k to 12k seconds.
  • Faster half marathon times tend to correspond to faster overall finish times.
  • There is variability in this above scatter plot in finish times for a given half marathon timee.

2. 'sex' vs 'netto' - Boxplot¶

In [19]:
color_palette = px.colors.qualitative.Set1
boxplot_sex_fig = px.box(primary_dataset, x='sex', y='netto',
                         title='Boxplot of Finish Time vs Gender (Primary Dataset)',
                         labels={'sex': 'Gender', 'netto': 'Finish Time (seconds)'},
                         color='sex', color_discrete_sequence=color_palette)
boxplot_sex_fig.show()

Summary:

  • The above Boxplot shows comparison of 2013 primary dataset's gender-specific finish time distribution.
  • It gives a brief summary of the variations in finish times by displaying the median, interquartile range, and outliers for both men and women runners.
  • In this statistics, the lower and upper fences are boundaries that can be used to spot possible outliers in a dataset.
  • From this anlysis, Men has the minimum finish time than Women.

3. 'jahrgang' vs 'netto' - Boxplot¶

In [20]:
boxplot_jahrgang_fig = px.box(primary_dataset, x='jahrgang', y='netto',
                              title='Boxplot of Finish Time vs Birth Year (Primary Dataset)',
                              labels={'jahrgang': 'Birth Year', 'netto': 'Finish Time (seconds)'},
                              color='jahrgang', color_discrete_sequence=color_palette)
boxplot_jahrgang_fig.show()

Summary:

  • The above box plot shows distribution of finish times for various birth years from 1930 to 1990.
  • For each birth year category, it displays the median finish time, the interquartile range, which is the spread of the middle 50% of times, and any possible outliers.
  • This makes it possible to compare finish time patterns across various birth years quickly.
  • Here we analysed that as the age increases they require finish time more than the younger runners.
  • There is an obivious trend towards longer finish time netto, particularly as the birth year increases, indicating older age.
  • This above finding is consistent with what is generally known about marathon running, which is that older runners typcally have longer netto time than younger runners.

4. 'ak' vs 'netto' - Boxplot¶

In [21]:
color_palette = px.colors.qualitative.Set1
boxplot_ak_fig = px.box(primary_dataset, x='ak', y='netto',
                        title='Boxplot of Finish Time vs Age Group (Primary Dataset)',
                        labels={'ak': 'Age Group', 'netto': 'Finish Time (seconds)'},
                        color='ak', color_discrete_sequence=color_palette)
boxplot_ak_fig.show()

Summary:

  • The above box plot shows distribution of finish times for different age groups in different colors.
  • It gives us information about the spread (interquartile range), potential outliers, and central tendency (median) for each age group.
  • From this here we compare finish time netto vs different age groups.
  • Here, we understand that finish time starts from 7000 to 25000 seconds VS age.
  • Normally U20 ie. 18-19, H ie. 20-29, 30 to 40 age group people finish the race in minimal time seconds comparatively than others.

5.'nation' vs 'netto' - Boxplot¶

In [23]:
average_time_by_nation = primary_dataset.groupby('nation')['netto'].mean().reset_index()
top_10_nations = average_time_by_nation.nlargest(10, 'netto')
warnings.simplefilter(action='ignore', category=FutureWarning)
plt.figure(figsize=(10, 5))
sns.barplot(x='nation', y='netto', data=top_10_nations, palette='coolwarm')
plt.title('Top 10 Nations with the Highest Average Netto (Finish Time)')
plt.xlabel('Nation')
plt.ylabel('Average Netto (Finish Time)')
plt.show()

Summary:

  • The above box plot shows average distribution of finish times for top 10 nations.
  • This makes it possible to compare the performance of different nations in the main dataset quickly.
  • It contributes to an in-depth study of the dataset and its meaning for marathon participant profiling and race strategy considerations.
  • This graph is a useful for understanding how runners from different nations perform on average in the context of the marathon.
  • From above we can observe that nation FSM runners has max finsh time (24000) seconds than any other nations. While, ZWE nation runners required comparatively less than FSM but required more average finish time (13000) seconds than the other nations.

6. Correlation primary dataset - Heatmap¶

In [24]:
plt.figure(figsize=(12, 8))
sns.heatmap(primary_dataset.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap (Primary Dataset)')
plt.show()

corr_coefficient, p_value = pearsonr(primary_dataset['netto'], primary_dataset['halbmarathon'])
print(f"\nPrimary dataset: \nPearson correlation coefficient between 'netto' and 'halbmarathon': {corr_coefficient:.4f}")

male_finish_times = primary_dataset[primary_dataset['sex'] == 'M']['netto']
female_finish_times = primary_dataset[primary_dataset['sex'] == 'W']['netto']
t_statistic, p_value_ttest = ttest_ind(male_finish_times, female_finish_times)
print(f"\nT-test for 'sex' and 'netto' in Primary dataset:")
print(f"T-statistic: {t_statistic:.4f}")
Primary dataset: 
Pearson correlation coefficient between 'netto' and 'halbmarathon': 0.9601

T-test for 'sex' and 'netto' in Primary dataset:
T-statistic: -56.1901

Summary:

  • The half marathon time, or "halbmarathon," and the finish time, "netto," have a Pearson correlation coefficient of 0.9601. A strong linear relationship between the total finish time and the half marathon time is indicated by this high positive correlation (nearly 1). 'Netto' also tends to increase as 'halbmarathon' increases.

  • The'sex' and 'netto' t-tests compare the times of completion for men and women competitors. -56.1901 is the determined t-statistic. A significant difference in finish times between men and women athletes is indicated by this large negative t-statistic.

  • The primary dataset's numeric variable correlation matrix is displayed visually in the above heatmap.


  1. Positive Correlation:

    • If the correlation coefficient is positive +, it indicates a positive relationship.
    • For example, if 'netto' has a positive correlation with 'halbmarathon', it suggests that as the year of birth ('jahrgang') increases, the overall finish times also tend to increase.
  2. Negative Correlation:

    • If the correlation coefficient is negative -, it indicates a negative relationship.
    • For example, if 'netto' has a negative correlation with 'jahrgang', it suggests that the year of birth ('jahrgang') increases, the overall finish times tend to decrease.
  3. Correlation Strength:

    • The values in the cells show the correlation's strength and direction. Strong correlations are indicated by values near 1 or -1, while weak correlations are indicated by values near 0.

2.2¶

1. Relationship between finish times and pacing at halfway point - Scatter Plot¶

In [25]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x='halbmarathon', y='netto', data=primary_dataset)
plt.title("Relationship between Half Marathon Time vs Finish Time - 2013")
plt.xlabel("halbmarathon (Seconds)")
plt.ylabel("netto (Seconds)")

plt.subplot(1, 2, 2)
sns.scatterplot(x='halbmarathon', y='netto', data=secondary_dataset)
plt.title("Relationship between Half Marathon Time vs Finish Time - 2022")
plt.xlabel("halbmarathon (Seconds)")
plt.ylabel("netto (Seconds)")

plt.tight_layout()
plt.show()

Summary:

  1. Positive Trend: When the regression line slopes upward, it indicates that the half marathon time tends to grow on average, along with the finish time.

  2. Strength of association: The strength of the association is indicated by how closely the data points cluster around the regression line. The correlation between the variables is stronger when the cluster of points is narrower.

  3. Outliers: points where the regression line differs significantly from the average. A singular instance that differs from the overall pattern shownm in the data is represented by an outlier.

  4. There is a positive linear correlation, indicating that the trend observed in 2013 persists in 2022.

2. Correlation and coefficients¶

In [26]:
# correlation coefficient func
def calculate_correlation(dataset, x_col, y_col):
    correlation_coefficient, _ = pearsonr(dataset[x_col], dataset[y_col])
    return correlation_coefficient

# linear regression and get coefficients func
def calculate_coefficients(dataset, x_col, y_col):
    X = dataset[[x_col]]
    y = dataset[y_col]

    model = LinearRegression()
    model.fit(X, y)

    intercept = model.intercept_
    coefficient = model.coef_[0]

    return intercept, coefficient

# correlation coefficient and coefficients - primary dataset
correlation_primary = calculate_correlation(primary_dataset, 'halbmarathon', 'netto')
intercept_primary, coefficient_primary = calculate_coefficients(primary_dataset, 'halbmarathon', 'netto')

# correlation coefficient and coefficients - secondary dataset
correlation_secondary = calculate_correlation(secondary_dataset, 'halbmarathon', 'netto')
intercept_secondary, coefficient_secondary = calculate_coefficients(secondary_dataset, 'halbmarathon', 'netto')

print("Primary Dataset:")
print(f"Correlation Coefficient: {correlation_primary:.4f}")
print(f"Intercept: {intercept_primary:.4f}")
print(f"Coefficient: {coefficient_primary:.4f}")

print("\nSecondary Dataset:")
print(f"Correlation Coefficient: {correlation_secondary:.4f}")
print(f"Intercept: {intercept_secondary:.4f}")
print(f"Coefficient: {coefficient_secondary:.4f}")
Primary Dataset:
Correlation Coefficient: 0.9601
Intercept: -685.6404
Coefficient: 2.1784

Secondary Dataset:
Correlation Coefficient: 0.9645
Intercept: -1041.4272
Coefficient: 2.2566

Summary:

  1. correlation :  A linear relationship between two variables' strength and direction are measured by correlation, which has values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).

  2. The correlation coefficient : provides a precise indication of how closely variables move together in a linear fashion by quantifying this relationship numerically.

  • The above calculations conclude both datasets shows a strong positive linear relationship between Half Marathon Time and Finish Time.
  • As Half Marathon Time increases, Finish Time tends to increase, as indicated by high correlation coefficients (0.9601 for the primary dataset and 0.9645 for the secondary dataset).
  • When the half marathon time is zero, the intercept of the regression line indicates the estimated finish time.
  • The linear regression coefficients suggest that, on average, a one-second increase in Half Marathon Time corresponds to an increase in Finish Time of approximately 2.1784 seconds in the primary dataset and 2.2566 seconds in the secondary dataset.

Task 3. Pacing Analysis¶

3.1¶

  1. Even pacing: Runners who maintained a consistent pace throughout the race.
  2. Positive split: Runners who started the race faster than they finished.
  3. Negative split: Runners who started the race slower than they finished.

1. Impact of pacing on 'netto' Finish Time - Violin Plot¶

In [27]:
primary_dataset['netto_first_half'] = primary_dataset['halbmarathon']
primary_dataset['netto_second_half'] = primary_dataset['netto'] - primary_dataset['halbmarathon']

# categorize split type func
def split_type(row):
    if row['netto_first_half'] < row['netto_second_half']:
        return 'Negative Split'
    elif row['netto_first_half'] > row['netto_second_half']:
        return 'Positive Split'
    else:
        return 'Even Split'

primary_dataset['split_type'] = primary_dataset.apply(split_type, axis=1)
plt.figure(figsize=(12, 6))
elite_threshold = primary_dataset['netto'].quantile(0.10)
primary_dataset['is_elite'] = primary_dataset['netto'] <= elite_threshold
plt.subplot(1, 2, 1)
sns.violinplot(x='split_type', y='netto', data=primary_dataset, palette='pastel', inner='quartile')
plt.title("Impact of Pacing on Finish Times")
plt.xlabel("Split Type")
plt.ylabel("Netto Finish Time (seconds)")

plt.subplot(1, 2, 2)
sns.violinplot(x='split_type', y='netto', hue='is_elite', data=primary_dataset, palette='Set1', split=True, inner='quartile')
plt.title("Impact of Pacing on Finish Times by Elite Status")
plt.xlabel("Split Type")
plt.ylabel("Netto Finish Time (seconds)")
plt.legend(title='Elite Status')

plt.tight_layout()
plt.show()

Summary:

  1. Left Plot (Impact of Pacing on Finish Times):
  • This plot shows the distribution of netto finish time VS split types (Negative, Positive, Even).
  • with the help of this we figure out the how pacing stratergies affects (netto) finish time.
  • The violin plot's quantiles show the data's distribution and central tendency.
  • Here, I analysed that negative split has the max finish time in second half than the others positive and even split.
  1. Right Plot (Impact of Pacing on Finish Times by Elite Status):
  • This plot includes elite status (top 10% runners) as a factor, expanding on the left plot data.
  • Which enables the impact of pace on finish times to be compared between elite and non-elite runners.
  • The distribution for each split type is highlighted by the split violin plot, which separates elite and non-elite groups.
  • For runners who are elite or not, quantiles offer information about the spread and central tendency.

2. Impact of pacing on 'netto' Finish Time by gender - Strip Plot¶

In [28]:
fig_strip = px.strip(primary_dataset, x='split_type', y='netto', color='sex',
                     title="Impact of Pacing on Finish Times by Gender (Strip Plot)",
                     labels={'split_type': 'Split Type', 'netto_second_half': 'Netto Finish Time (seconds)'},
                     category_orders={'sex': ['MALE', 'FEMALE']})
fig_strip.update_layout(legend=dict(title=dict(text='Gender')))
fig_strip.show()

Summary:

  • The above plot shows the netto finish times for each runner in respect to split types (Negative, Positive, Even) by displaying individual data points as strips.
  • Here we enables a visual comparison of how pace affects both genders' (MEN and WOMEN) finish times.
  • with the help of this we can spotting patterns and trends by examining the density and distribution of data points in primary dataset. 
  • Which gives a natural example of how gender affects the connection between finish times and pacing strategies.

According to this above pattern, runners who accomplish a negative split—running faster in the second half than the first generally have longer netto finish times than those who achieve a positive or even split.

3. Impact of pacing on 'netto' Finish Time by Age group - Box Plot¶

In [29]:
custom_palette = ['#1f78b4', '#33a02c', '#e31a1c', '#ff7f00', '#6a3d9a', '#b15928']
fig_age_group = px.box(primary_dataset, x='split_type', y='netto', color='ak',
                      title="Impact of Pacing on Finish Times by Age Group",
                      labels={'split_type': 'Split Type', 'netto_second_half': 'Netto Finish Time (seconds) for Second Half'},
                      color_discrete_sequence=custom_palette)
fig_age_group.update_layout(legend=dict(title=dict(text='Age Group')))
fig_age_group.show()

Summary:

  • The box plot shows how pace affects finish times by age group.
  • The data is divided into three categories by this method: Even Split, Positive Split, and Negative Split.
  • For the race, the netto finish time is represented on the y-axis.
  • The split types show differences in the distribution of finish times.
  • The median and interquartile range (IQR) for each age group are displayed in a box plot.
  • Looking at the box plots above, which provide insight on how various age groups' finish times are impacted by pacing strategies.

It has been noted that the "Negative Split" category typically has netto finish times that are higher than "Positive Split," which is followed by "Even Split." This suggests that runners who accomplish a negative spli running faster in the second half than the first generally have longer netto finish times than those who achieve a positive split or even split across age groups.

4. Elite and Non Elite by Gender - Pie Chart¶

In [30]:
elite_count = primary_dataset.groupby(['sex', 'is_elite']).size().reset_index(name='count')
elite_count_pivot = elite_count.pivot(index='sex', columns='is_elite', values='count').fillna(0)
labels = ['Non-Elite Men', 'Elite Men', 'Non-Elite Women', 'Elite Women']
colors = ['#ff9999', '#ffcc99', '#66b3ff', '#99ff99']
explode = (0.1, 0, 0.1, 0)

fig, ax = plt.subplots()
ax.pie(elite_count_pivot.values.flatten(), labels=labels, autopct='%1.1f%%', colors=colors, startangle=90, pctdistance=0.85, explode=explode)
ax.axis('equal')
plt.title("Percentage of Elite and Non-Elite Athletes by Gender")
plt.legend(title='Elite Status', loc="upper right", bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()

Summary:

  • The pie chart visually represents the percentage distribution of elite and non-elite athletes across genders M and W (Primary dataset).
  • Only 4% of female athletes are elite, and the remaining 96% are not. There are 9.6% elite male athletes and 90.4% non-elite male athletes.

5. Impact of Pacing 'netto' vs age group and elite status - violin plot¶

In [31]:
def split_type(row):
    if row['netto_first_half'] < row['netto_second_half']:
        return 'Negative Split'
    elif row['netto_first_half'] > row['netto_second_half']:
        return 'Positive Split'
    else:
        return 'Even Split'

primary_dataset['split_type'] = primary_dataset.apply(split_type, axis=1)
fig = px.violin(primary_dataset, x='ak', y='netto', color='is_elite', box=True, points="all",
                title="Impact of Pacing on Finish Times by Age Group and Elite Status",
                labels={'netto': 'Netto Finish Time (seconds)', 'ak': 'Age Group', 'is_elite': 'Elite Status'},
                category_orders={'ak': sorted(primary_dataset['ak'].unique())},
                width=800, height=600,
                facet_col="sex", facet_col_wrap=2,  
                violinmode="overlay", 
                hover_data={'netto': True, 'is_elite': True, 'sex': True, 'ak': True, 'split_type': True},
                template="plotly" 
                )

fig.show()

Summary:

  • The above violin plot, which takes age groups and elite status, graphically shows how pace affects finish times.
  • Violin plots for various age groups are displayed in each subplot, which is arranged by gender.
  • The density of data points is represented by the width of the violin, which covers both elite and non-elite runners.
  • The gender-organized subplots present the data density for both elite and non-elite runners. Interactive hover features enhance data exploration by providing access to detailed insights such as split type, age group, netto finish time, and runner information.
  • By hovering over the visualization, we can explore particular data points, split types, and other details.
  • for eg:

  • Runner - MEN :

    • Elite Status: False
    • Gender: Male (M)
    • Age Group: 35
    • Netto Finish Time: 18.616 k seconds
    • Split Type: Negative
  • Runner - WOMEN :

    • Elite Status: False
    • Gender: Female (W)
    • Age Group: 35
    • Netto Finish Time: 15.914 k seconds
    • Split Type: Negative

Here, we analysed that preference of the impact of Pacing 'netto' vs age group and elite status which depends of the elite runner and here - age H ie. 20-29 age groups people use negative split stratergy to complete the race. The trend of the elite stratergy runners from 20 - 65. And rest of the runners are the non elite.

3.2¶

Comprehensive Analysis of Primary Dataset (2013):¶

  1. Marathon Participant Profiling,
  2. Pacing Strategies and
  3. Performance Dynamics
  • In the above part we have several analysis on the priamry dataset 2013, various visualisation and statastical measures which provides important characteristics of marathon articipants.
  • The dataset contains details on age groupss, gender, elite status, pacing, finish times, and more.
  • Strong positive correlations between the half marathon time and the total finish time were shown by scatter plots.
  • Potential outliers were identified and the finish time distribution was highlighted using boxplots categorized by gender, age group, and country.
  • The effect of pacing strategies on finish times was shown through pacing analysis, which differentiated between negative, positive, and even splits.
  • Examining the distribution of bonking incidents by age and gender provided insight into the performance issues that racers faced.
  • Subsequent investigation revealed variations in finish times and gender distribution between elite and non-elite competitors.
  • The comprehensive examination helps with participant analysing and race strategy considerations in addition to offering an in-depth comprehension of individual factors and adding to the larger context of marathon dynamics.

3.3¶

1. Visualize bonking in the second half across gender - Bar Chart¶

In [32]:
primary_dataset['percentage_slowdown'] = ((primary_dataset['netto_second_half'] - primary_dataset['netto_first_half']) / primary_dataset['netto_first_half']) * 100
bonk_threshold = 30
primary_dataset['has_bonked'] = primary_dataset['percentage_slowdown'] >= bonk_threshold
In [33]:
plt.figure(figsize=(8, 6))
ax = sns.countplot(x='sex', hue='has_bonked', data=primary_dataset, palette='Set2')
for p in ax.patches:
    height = p.get_height()
    ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2., height),
                ha='center', va='baseline', fontsize=12, color='black', xytext=(0, 5),
                textcoords='offset points')

plt.title("Bonking Across Gender with Counts")
plt.xlabel("Gender")
plt.ylabel("Count - Runners in race")
plt.show()

Summary:

  • The above code determines how much each runner in the primary dataset has slowed down in their finish times between the first and second halves.
  • It is used to identify runners who noticed a significant drop in pace (bonk) in the second half of at least 30%.
  • The distribution of bonking events by gender is shown graphically in the countplot, which provides view on how this occurrence varies in men and women runners.

  • Here we analysed that Men(980) and women(122) who have the significant drop in pace (bonk) in the second half of at least 30%. Usually men count is high than women count.

2. Visualize bonking in the second half across age group - Stacked bar chart¶

In [34]:
fig = px.histogram(primary_dataset, x='ak', color='has_bonked', barmode='stack',
                   category_orders={'ak': sorted(primary_dataset['ak'].unique())},
                   labels={'ak': 'Age Group', 'has_bonked': 'Bonked'},
                   color_discrete_sequence=['#1f78b4', '#e31a1c'], 
                   title="Bonking Across Age Group",
                   facet_col="sex", facet_col_wrap=2,
                   width=800, height=600)
fig.update_layout(xaxis_title="Age Group", yaxis_title="Count - Runners", legend_title="Bonked",
                  showlegend=True, template="plotly", 
                  plot_bgcolor='rgba(255, 255, 255, 0.7)')  
fig.show()

Summary:

  • The above stacked bar chart graphically represents the count of bonking across different age groups and genders (M,W).
  • Every bar is divided into two colors, one for the number of athletes who were subjected to bonking and another for those who were not.
  • The majority of runners in most age groups did not experience bonking.
  • Here, I analysed that in terms of gender, Men runners are more likely than Women runners to experience bonking.

3. Analyze variation across finish time - Scatter Plot¶

In [35]:
fig = px.scatter(primary_dataset, x='netto', y='percentage_slowdown', color='has_bonked',
                 title="Bonking Across Finish Time",
                 labels={'netto': 'Netto Finish Time (seconds)', 'percentage_slowdown': 'Percentage Slowdown'},
                 color_discrete_map={False: 'blue', True: 'red'})
fig.show()

Summary:

  • The above scatter plot shows the relationship between netto finish time and percentage slowdown.
  • A runner is represented by each point, and the color indicates whether they were bonked (shown in red) or not (shown in blue).
  • This is used for data visualization in spotting possible patterns, trends, or outliers in the primary data.
  • Every dot on the plot represents a single runner , and the color designates whether or not  they were bonked (red) or not (blue).
  • The blue point concentration between -50% and 50% indicates that most runners who were not bonked slowed down in this range.

  • Here, I conclude that most bonked runners (shown in red) showed a slowdown between 30% and 50% OF their previous pace.

4. Second half with markers for bonked runners - Scatter Plot¶

In [36]:
# Scatter plot for the first half
fig = go.Figure()

fig.add_trace(go.Scatter(x=primary_dataset[primary_dataset['sex'] == 'M']['netto_first_half'],
                         y=primary_dataset[primary_dataset['sex'] == 'M']['netto_second_half'],
                         mode='markers', name='Male', marker=dict(color='#1f78b4')))
fig.add_trace(go.Scatter(x=primary_dataset[primary_dataset['sex'] == 'W']['netto_first_half'],
                         y=primary_dataset[primary_dataset['sex'] == 'W']['netto_second_half'],
                         mode='markers', name='Female', marker=dict(color='#33a02c')))

fig.update_layout(title="First Half vs. Second Half by Gender",
                  xaxis_title="Netto Finish Time - First Half (seconds)",
                  yaxis_title="Netto Finish Time - Second Half (seconds)",
                  template="plotly", width=800, height=400)

fig.show()

# Scatter plot for the second half with bonked runners
fig_bonked = go.Figure()

fig_bonked.add_trace(go.Scatter(x=primary_dataset[~primary_dataset['has_bonked']]['netto_first_half'],
                                y=primary_dataset[~primary_dataset['has_bonked']]['netto_second_half'],
                                mode='markers', name='Not Bonked', marker=dict(color='#1f78b4')))
fig_bonked.add_trace(go.Scatter(x=primary_dataset[primary_dataset['has_bonked']]['netto_first_half'],
                                y=primary_dataset[primary_dataset['has_bonked']]['netto_second_half'],
                                mode='markers', name='Bonked', marker=dict(color='#e31a1c')))

fig_bonked.update_layout(title="First Half vs. Second Half with Bonked Runners Marked",
                          xaxis_title="Netto Finish Time - First Half (seconds)",
                          yaxis_title="Netto Finish Time - Second Half (seconds)",
                          template="plotly", width=800, height=400)

fig_bonked.show()

Summary:

  1. First Half vs. Second Half by Gender (M,W):

    • The netto finish times of the first and second halves, broken down by gender, are compared in the 1st plot.
    • Consistent pacing between the two halves is indicated by points that fall on or near the diagonal line.
    • Clusters that are away the diagonal which suggest different pacing.
  2. First Half vs. Second Half with Bonked Runners Marked:

    • The analysis is further expanded in the 2nd plot, which identifies runners who significantly slowed down in the second half as "bonked."
    • A 'hover' indicates runners who slowed by at least 30% during the second half of the race.
    • This facilitates the choice of whether bonking is linked to particular first half patterns.

These visuals help to identify bonking times as well as understand the pacing patterns between halves, which can offer insight into performance-influencing variables. The above scatter plots show how runners perform in the first and second halves, with the first plotting gender differences and the second plotting runners who were bonked.

Task - 4¶

1. 'halbmarathon' vs 'netto' Finish Time (Secondary Dataset) - Scatter Plot¶

In [37]:
scatter_fig = px.scatter(secondary_dataset, x='halbmarathon', y='netto',
                         title='Scatter Plot - Half Marathon Time vs Finish Time (Secondary dataset)',
                         labels={'halbmarathon': 'Half Marathon Time (seconds)', 'netto': 'Finish Time (seconds)'})
scatter_fig.show()

Summary:

  • Here, I am using Scatter Plot like previously used in task 2 on primary dataset to see if there is a pattern or correlation between the half marathon time and the total finish time for runners in the secondary dataset.

  • This is uelpful for getting a better understanding of the structure of the dataset by pointing out possible linear relationships, outliers, or patterns of data points around particular areas.

  • Primary dataset represent the data for 2013 which has half marathon time seconds from 4k to 12k and here in secondary dataset starts from 4k to 14k comparatively increased by 2k difference.

Hence, to see the main findings from task 2 for this analysis hold in simillar trend for this secondary 2022 dataset.

2. 'pacing seconds' vs 'ak' - Box Plot¶

In [38]:
secondary_dataset['pacing_seconds'] = secondary_dataset['netto'] - secondary_dataset['halbmarathon']
fig = px.box(secondary_dataset, x='ak', y='pacing_seconds', color='sex',
             title='Pacing Analysis by Gender and Age Group (Secondary Dataset)')
fig.update_layout(xaxis_title='Age Group', yaxis_title='Pacing (seconds)', legend_title='Gender')
fig.show()

Summary:

  • The above boxplot shows comparison of 2022 secondary dataset's gender-specific finish time distribution.
  • As earlier we seen this above for primary dataset in task 2, which gives a summary of the variations in finish times by displaying the median, interquartile range, and outliers for both men and women runners.
  • From this anlysis, Men has the minimum finish time than Womem in their all age groups.

To conclude that the both primary and secondary dataset hold simillar pattermn extraction from their data itself.

3. Distribution Pacing Seconds in Secondary Dataset - Histogram Plot¶

In [39]:
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=['Primary 2013', 'Secondary 2022'])
primary_dataset['pacing_seconds'] = primary_dataset['netto'] - primary_dataset['halbmarathon']
fig_primary = px.histogram(primary_dataset, x='pacing_seconds', nbins=20)
secondary_dataset['pacing_seconds'] = secondary_dataset['netto'] - secondary_dataset['halbmarathon']
fig_secondary = px.histogram(secondary_dataset, x='pacing_seconds', nbins=20)
fig.add_trace(fig_primary['data'][0], row=1, col=1)
fig.add_trace(fig_secondary['data'][0], row=1, col=2)
fig.update_layout(title_text='Comparison of Pacing Times Between Datasets', showlegend=False)
fig.update_xaxes(title_text='Pacing Time (seconds)', row=1, col=1)
fig.update_yaxes(title_text='Count - Runners', row=1, col=1)
fig.show()

Summary:

  • The histogram displays the distribution of pacing times in the primary and second dataset.
  • Pacing times represent the difference between the net finish time of the second half and the first half of the race for each participant.
  • This above histogram provides insights into the frequency and spread of pacing times, helping to identify patterns or trends in how athletes manage their pace during the race.
  • The x-axis represents the pacing times, divided into 20 bins, and the y-axis indicates the count of participants within each bin.

Here we analysed that most of the part in above graph shows the same holding points with each other dataset records for year 2013 and 2022; Only difference from 10,000 Pacing Time seconds.

4. Impact of pacing on 'netto' Finish Time - Violin Plot¶

In [40]:
secondary_dataset['netto_first_half'] = secondary_dataset['halbmarathon']
secondary_dataset['netto_second_half'] = secondary_dataset['netto'] - secondary_dataset['halbmarathon']
def split_type(row):
    if row['netto_first_half'] < row['netto_second_half']:
        return 'Negative Split'
    elif row['netto_first_half'] > row['netto_second_half']:
        return 'Positive Split'
    else:
        return 'Even Split'

secondary_dataset['split_type'] = secondary_dataset.apply(split_type, axis=1)
plt.figure(figsize=(12, 6))
elite_threshold = secondary_dataset['netto'].quantile(0.10)
secondary_dataset['is_elite'] = secondary_dataset['netto'] <= elite_threshold
plt.subplot(1, 2, 1)
sns.violinplot(x='split_type', y='netto', data=secondary_dataset, palette='pastel', inner='quartile')
plt.title("Impact of Pacing on Finish Times")
plt.xlabel("Split Type")
plt.ylabel("Netto Finish Time (seconds)")

plt.subplot(1, 2, 2)
sns.violinplot(x='split_type', y='netto', hue='is_elite', data=secondary_dataset, palette='Set1', split=True, inner='quartile')
plt.title("Impact of Pacing on Finish Times by Elite Status")
plt.xlabel("Split Type")
plt.ylabel("Netto Finish Time (seconds)")
plt.legend(title='Elite Status')

plt.tight_layout()
plt.show()

Summary:

  1. Left Plot (Impact of Pacing on Finish Times):
  • This plot shows the distribution of netto finish time VS split types (Negative, Positive, Even).
  • with the help of this we figure out the how pacing stratergies affects (netto) finish time.
  • The violin plot's quantiles show the data's distribution and central tendency.
  1. Right Plot (Impact of Pacing on Finish Times by Elite Status):
  • This plot includes elite status (top 10% runners) as a factor, expanding on the left plot data.
  • Which enables the impact of pace on finish times to be compared between elite and non-elite runners.
  • The distribution for each split type is highlighted by the split violin plot, which separates elite and non-elite groups.
  • For runners who are elite or not, quantiles offer information about the spread and central tendency.

This second dataset 2022 hold the approximately same analysis as we have created above for the primary dataset 2013. There is only slight difference in the Even, negative, Positive split.

5. Elite and Non Elite by Gender - Pie Chart¶

In [41]:
elite_count = secondary_dataset.groupby(['sex', 'is_elite']).size().reset_index(name='count')
elite_count_pivot = elite_count.pivot(index='sex', columns='is_elite', values='count').fillna(0)
labels = ['Non-Elite Men', 'Elite Men', 'Non-Elite Women', 'Elite Women']
colors = ['#ff9999', '#ffcc99', '#66b3ff', '#99ff99']
explode = (0.1, 0, 0.1, 0)

fig, ax = plt.subplots()
ax.pie(elite_count_pivot.values.flatten(), labels=labels, autopct='%1.1f%%', colors=colors, startangle=90, pctdistance=0.85, explode=explode)
ax.axis('equal')
plt.title("Percentage of Elite and Non-Elite Athletes by Gender")
plt.legend(title='Elite Status', loc="upper right", bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()

Summary:

From Task 2, to gathering records for 2013 primary dataset:

  • Elite Athletes:

    • 2022: Elite women comprise 0.9%, and elite men comprise 9.1%.
    • 2013: Elite women make up 0.4%, and elite men make up 9.6%.
  • Non-Elite Athletes:

    • 2022: Non-elite women represent 32.2%, and non-elite men represent 57.8%.
    • 2013: Non-elite women account for 24.1%, and non-elite men account for 65.9%.

Men are more likely than women to participate in the non-elite category in both datasets. The distribution of participants in the elite category remains consistent, with a small percentage of participants.

6. Visualize bonking in the second half across gender - Bar Chart¶

In [42]:
secondary_dataset['percentage_slowdown'] = ((secondary_dataset['netto_second_half'] - secondary_dataset['netto_first_half']) / secondary_dataset['netto_first_half']) * 100
bonk_threshold = 30
secondary_dataset['has_bonked'] = secondary_dataset['percentage_slowdown'] >= bonk_threshold

plt.figure(figsize=(8, 6))
ax = sns.countplot(x='sex', hue='has_bonked', data=secondary_dataset, palette='Set2')
for p in ax.patches:
    height = p.get_height()
    ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2., height),
                ha='center', va='baseline', fontsize=12, color='black', xytext=(0, 5),
                textcoords='offset points')

plt.title("Bonking Across Gender with Counts")
plt.xlabel("Gender")
plt.ylabel("Count - Runners")
plt.show()

Summary:

From Task 3, to gathering records for 2013 primary dataset:

2013 Dataset:

* Men:
* True (Elite): 980
* False (Non-elite): 26,539
* Women:
* True (Elite): 122
* False (Non-elite): 882

2022 Dataset:

* Men:
* True (Elite): 1,841
* False (Non-elite): 21,389
* Women:
* True (Elite): 353
* False (Non-elite): 11,141

  • Both in 2013 and 2022, the majority of participants are non-elite men, with a higher count in 2013.
  • The count of elite participants (both men and women) is higher in 2022 than in 2013.
  • The count of non-elite women is higher in 2022 compared to 2013.

Generally, the dataset for 2022 seems to have a larger number of participants, with an increase in both elite and non-elite categories.


Conclusion :¶

To summarize, the above analysis of the marathon datasets from 2013 and 2022 provides significant understandings into participant characteristics, the impact of pace on netto finish times, and the dynamics of performannce. A few interesting finds are the impact of pacing strategies, the differences in finish times between age groups and genders as represented by box and viiolin plots, and the distribution of elte and non-elite athlete as shown by pie charts. It has been determined that negative splits extend netto finish times. Gender differences in significant second-half pace drops are shown by the bonking analysis. Pattern recognition is helped by the effective visualization of relationships through scatter plots, such as half marathon time versus netto finish time. The 2022 dataset shows a higher number of participants, while participant demographics and performance trends are consistent across datasets. With a accurate grasp of marathon dynamics provided by these analyses, race strategy decisions for both years can be made with knowledge.